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Abstract: We provide a determination of the isotriplet quark distribution from available 
deep-inelastic data using neural networks. We give a general introduction to the neural 
network approach to parton distributions, which provides a solution to the problem of 
constructing a faithful and unbiased probability distribution of parton densities based on 
available experimental information. We discuss in detail the techniques which are necessary 
in order to construct a Monte Carlo representation of the data, to construct and evolve 
neural parton distributions, and to train them in such a way that the correct statistical 
features of the data are reproduced. We present the results of the application of this method 
to the determination of the nonsinglet quark distribution up to next-to-next-to-leading 
order, and compare them with those obtained using other approaches. 
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1. Introduction 



The needs of precision physics at hadron colliders have determined a revolution in the 
approach to the determination of parton distributions functions (PDFs) over the last few 
years. While the Tevatron has been providing data for a variety of hard hadronic processes 
which establish the validity and accuracy of perturbative factorization and parton univer- 
sality to a level which is now comparable to that of precision tests of the standard model 
in the electroweak sector, LHC, now behind the corner, will require, essentially for the first 
time, a precision approach to the structure of the nucleon in the context of searches for new 
physics [1]. Over a decade of experimental effort, especially in deep-inelastic experiments, 
first and foremost at the HERA collider [2], but also from neutrino beams and with muon 
beams on a fixed target, has provided us with an unprecedented amount of information 
which makes such a precision approach possible. However, we have now reached a stage in 
which the development of new theoretical and phenomenological analysis tools is needed [3] . 

The main recent development in the determination of parton distributions has been 
the availability of sets of parton distributions with errors [4-6]. Previously, errors on PDFs 
where crudely estimated by varying some sets of parameters, or comparing different de- 
terminations, and generally considered to be negligible in comparison to other sources of 
theoretical or experimental error. Now, the simultaneous progress in higher-order theoret- 
ical calculations and experimental results has made this simple-minded approach obsolete: 
a bona fide estimate of the error on PDFs has become necessary. 

This is a difficult problem, not only because of the usual difficulties in accounting 
properly for theoretical uncertainties (such as higher order perturbative corrections) which 
are non-gaussian and hard to estimate, and in collecting and propagating uncertainties 
contained in large experimental covariance matrices, but also because a PDF set is a set 
of functions, and therefore one is faced with the problem of constructing an error — a 
probability measure — on a space of functions [7]. This is clearly an ill-posed problem, 
because one is trying to infer an infinite amount of information from a finite set of data 
points: it can be made tractable only by introducing some theoretical assumptions. 

The standard approach to this problem is based on the choice of a specific functional 
form: the infinite-dimensional problem of determining a function is projected onto the 
finite-dimensional space of parameters which determine the given parametrization. Errors 
on PDFs are then essentially error ellipsoids in the finite-dimensional parameter space, 
which at least in principle can be determined by standard covariance matrix techniques. 
Based on this approach, three sets of PDFs with errors have been produced over the last 
few years [4-6] . The most obvious shortcoming of this sort of approach is that the choice 
of a parametrization is potentially a source of bias. More subtle problems are related to 
non-gaussian errors (either theoretical and experimental) and incompatible data, which 
render a simple maximum likelihood fit impossible or unreliable. 

These difficulties are apparent in the available sets of PDFs with errors, especially 
when they are compared with each other. Indeed, the direct determination of error bands 
as one-sigma contours seems to be possible only if a limited set of data is fitted. In 
particular, straightforward determination of errors as one-sigma contours has succeeded 
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in a global PDF fit to deep-inelastic data only [4], but as soon as new data (specifically 
from the Drell-Yan process) are added to this fit, their errors must be suitably rescaled [2]. 
Global fits to all available data are instead typically constructed by determining a priori a 
suitable "tolerance criterion", by inspection of the quality of the fit to all available data [5]. 
For example, one may determine the tolerance criterion by considering the spread of 90% 
confidence intervals for various experiments, as one moves away from the minimum of the 
X 2 along eigenvectors of the Hessian matrix, and taking the envelope of the resulting ranges. 
In practice, this suggests that a reasonable estimate of the one-sigma contours for PDFs 
can be obtained by selecting an appropriately large value of Ax 2 , such as A% 2 = 100 [5,6]. 
Even so, results obtained using different parton sets for relevant physical observables, such 
as W+Higgs production at the LHC, are sometimes found to disagree within their stated 
error bands [2,8]. The origin of these difficulties is most likely a combination of several 
factors: inconsistencies between different pieces of experimental data, underestimation of 
some experimental errors, bias due to specific choices of parton parametrization, and non- 
gaussian errors, especially in the space of parameters of individual parton parametrizations. 

These difficulties have stimulated various proposals for new approaches to the deter- 
mination of parton distributions, in particular with the aim of minimizing the bias related 
to the choice of a specific functional form, and trying instead to reconstruct the probability 
density in the full functional space of parton distributions. An interesting suggestion [7] is 
to use Bayesian inference based on the data in order to update the existing prior knowledge 
of PDFs, as summarized e.g. by a Monte Carlo sample based on an existing parton set. 
Although promising preliminary results have been obtained, no parton set based on this 
approach has been published yet. 

An alternative suggestion has been presented in Ref. [9]. The basic idea is to combine 
a Monte Carlo sampling of the probability measure on the space of functions that one is 
trying to determine (as in the approach of Ref. [7]) with the use of neural networks as 
universal unbiased interpolating functions. In a Monte Carlo approach, the function with 
error — the up quark distribution, say — is given as a Monte Carlo sample of replicas 
of the function, so that any statistical property of the underlying distribution can be 
derived from the given sample. For example, the average value of the function at some 
point is simply found as the average over the replicas, its error as the variance and so on. 
In a neural network approach, each function in the sample in turn is given by a neural 
network which, when fed the values of the kinematic variables as input, returns the value 
of the function itself as output. The underlying idea is that neural networks can be used 
as universal unbiased interpolators: starting from a Monte Carlo representation of the 
probability density at the points where it is known because there are data, they can be 
used to produce a representation of the probability density everywhere in the data region, 
and even to study the extrapolation outside it, and its breakdown. 

In Refs. [9, 10] this strategy was tried off on a somewhat simpler problem, namely, the 
construction of a parametrization of existing data on the deep-inelastic structure function 
F2(x,Q 2 ) of the proton and neutron. In such case, one is only testing that the method 
can be used to construct a faithful representation of the probability density in a space of 
functions, based on the measurement of the function at a finite discrete number of points. 
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The method was proven to be fast and robust, to be amenable to detailed statistical studies, 
and to be in many respects superior to conventional parametrizations of structure functions 
based on a fixed functional form. However, the determination of a parton set involves the 
significant complication of having to go from one or more physical observables to a set of 
parton distributions. In order to achieve the result, one must tackle various problems, such 
as deconvolution of hard coefficient functions and resolution of evolution equations, and 
several further subtleties such as the treatment of higher twist corrections and heavy quark 
thresholds. All these issues are of course rather well understood in the context of parton 
determinations, but their implementation in a neural Monte Carlo framework is nontrivial, 
both in principle and in practice. 

In this paper, we present a first determination of the nonsinglet quark distribution 
from deep-inelastic data based on neural networks. This result is the first step towards the 
determination of a full parton distribution set. Furthermore, almost all of the technical 
complications required for the construction of a neural parton arise and have to be tackled 
already in the nonsinglet case. Hence, this paper is a comprehensive introduction to neural 
parton fitting. 

The paper is organized as follows. In Sect. 2 we review the approach of Refs. [9, 10] to 
the construction of a neural network parametrization of a function based on its experimental 
sampling at a finite set of points, we describe its application to the construction of a 
parton set, thereby outlining the general strategy of the method, and we summarize the 
main features of our approach and thus of our results. In Sect. 3 we construct a Monte 
Carlo representation of the information contained in the data, and test for its statistical 
accuracy. In Sect. 4 we discuss a novel strategy to determine the evolution of parton 
distributions and to combine them with perturbative coefficients in order to determine the 
structure functions, which combines the speed and reliability of Mellin iV-space evolution 
methods with the flexibility of x-space evolution methods with respect to the choice of 
parton parametrization. In Sect. 5 we turn to the neural networks which are used to 
parametrize parton distributions, we discuss their features, the training procedure whereby 
they are fitted to the data, and the way this training is stopped before the noise in the 
data starts affecting the results of the fit. In Sect. 6 we present our reference next-to- 
leading order determination of the nonsinglet quark distribution, we test for the accuracy 
of our estimate of central value and statistical uncertainty, and we verify the stability of 
our result upon the minimization strategy. We discuss in particular how we arrive at an 
accurate determination of the statistical properties (specifically the uncertainty) of this 
determination. Potential sources of theoretical uncertainty, in particular higher twists and 
higher orders, are discussed in Sect. 7, while in Sect. 8 our reference result is compared to 
results obtained at different perturbative orders (leading and next-to-next-to-leading) and 
to alternative existing determinations. 

2. General strategy 

The method developed in Ref. [9] provides a general framework for the parametrization 
of physical observables by means of neural networks. This method has since then been 
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successfully applied to a diverse variety of physical problems: structure functions [10], 
spectral functions for r decays [11], energy spectra of B decays [12], and cosmic ray neu- 
trino fluxes [13], thereby proving its flexibility and robustness. Its application to parton 
distributions is based on the same underlying concepts, but it is significantly more intri- 
cate for a variety of reasons to be discussed shortly, the most obvious being that parton 
distributions are not directly physical observable quantities. 

The general underlying strategy of this approach is summarized in the flow-chart of 
Fig. 1, and, as in Ref. [9], it involves two distinct stages in order to go from the data to 
the parton parametrization. In the first stage, a Monte Carlo sample of replicas of the 
data ("artificial data") is generated. These can be viewed as a sampling of the probability 
measure on the space of physical observables (structure functions, cross sections, etc.) at 
the discrete points where data exist. In the second stage one uses neural networks to inter- 
polate between points where data exist. When constructing a parton set, this second stage 
in turn consists of two sub-steps: the determination of physical observables from parton 
distributions ( "evolution" ) , and the comparison of the physical observables thus computed 
to the data in order to tune the best-fit form of input neural parton distribution ("training 
of the neural network"). Combining these two steps, the space of physical observables is 
mapped onto the space of parton distributions, so the experimental information on the 
former can be interpolated by neural networks in the latter. Let us now describe each 
stage in turn, both in general and in the specific case discussed in this paper. 

The starting experimental data in general consist of the determination of several phys- 
ical observables in distinct experiments, each of which provides the measurement of the 
relevant quantity at a discrete set of values of the kinematical variables. In general, there 
will be a nontrivial set of correlations between determination of different quantities: e.g., 
different observables determined at different points may be correlated both between observ- 
ables and between points. In the current nonsinglet fit, we will consider a single observable, 
namely the nonsinglet nucleon structure function defined as 

if s (x, Q 2 ) = F*{x, Q 2 ) - Fi(x, Q 2 ) , (2.1) 

in terms of the proton and deuteron structure functions. 

The purpose of the artificial data generation is to produce a Monte Carlo set of 'pseudo- 
data', i.e. iV rC p replicas of the original set of data points: 

^(art)(fc). k= 1,... , N rcp , i = l,...,N dat , (2.2) 

where Fi denotes one individual measurement: in our specific case, a measurement of 
F^ s (xj, Q 2 ). The iV re p sets of iVdat points are distributed according to an iVd a t -dimensional 
multi-gaussian distribution about the original points, with expectation values equal to the 
central experimental values, and error and covariance equal to the corresponding experi- 
mental quantities. Because the distribution of the experimental data coincides (for a flat 
prior) with the probability distribution of the value of the structure function at the points 
where it has been measured [14,15], this Monte Carlo set gives a sampling of the probabil- 
ity measure at those points. We can then generate arbitrarily many sets of pseudo-data, 
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Figure 1: Flow-chart for the construction of a neural parton set. 



and choose the number of sets A^-ep in such a way that the properties of the Monte Carlo 
sample reproduce those of the original data set to arbitrary accuracy. For example, we 
can check that averages, variance and covariance of the pseudo-data reproduce central val- 
ues and covariance matrix elements of the original data. This step is denoted by 'Tests 
exp-art' in Fig. 1. Note that experimental errors (including correlated systematics) must 
be treated as gaussian since this is what experimental collaborations provide: the relevant 
issue is whether an underlying non-gaussian distribution of best-fit parton distributions 
can be obtained as a result of the fitting procedure. Whereas this might be problematic if 
errors are determined as one-sigma contours for a given parametrization, the Monte Carlo 
method gives considerably more flexibility, as we now show. 

The second step consists of training Ar ep sets of neural networks. Each set contains all 
the parton distributions which are being determined, as a function of x at a given reference 
scale Qq, and is based on all the data in one single replica of the original data set. Thus, 
each set will contain Apjf parton distributions, determined from iV"d a t data. The number 
of independent parton distributions is 1 < N p( ^ < 13, and it will depend on the specific 
data set which is used in the parton determination. At most, there can be six quark and 
antiquark distributions and one gluon distribution, but in practice not all of these may be 
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accessible with a given set of data. Specifically, in the nonsinglet fit discussed in this paper 
we will consider the simplest possible case, in which each set contains only one parton 
distribution, namely the C-even nonsinglet quark distribution 

qm(x, Ql) = [u(x, Ql) + u(x, Q 2 ) - (d(x, Q 2 ) + d(x, Q 2 ))] . (2.3) 

As in any parton fit, in order to compare the parton distributions with the data, the 
PDFs must be evolved from the reference scale Qq to the scale at which data are given, and 
combined with hard partonic cross sections in order to obtain physical observables. Unlike 
in most parton fits (such as e.g. [4-6]), however, each parton distribution is parametrized 
by a neural network, rather than by a fixed functional form. Then, the N^t data in each 
replica are used to determine the iV p df neural networks of the corresponding set. At the 
end of the procedure, we end up with iV re p sets of iVpdf parton distributions, with each 
PDF given by a neural network. The iV re p replicas of each parton distribution provide the 
corresponding probability density: for example, the mean value of the parton distribution 
at the starting scale for a given value of x is found by averaging over the replicas, and the 
uncertainty on this value is the variance of the values given by the replicas. 

The fit of the parton distributions to each replica of the data is performed by maximum 
likelihood, by minimizing an error function, which coincides with the x 2 of the experimental 
points when compared to their theoretical determination obtained using the given set 
of parton distributions. The \ 2 ls computed by fully including the covariance matrix 
of the correlated experimental uncertainties (with a suitable treatment of normalization 
errors [14]), as we shall discuss in more detail in Sect. 5.1 below. Unlike in conventional 
determinations of parton distributions with errors [4-6], however, the covariance matrices 
of the best-fit parameters are irrelevant and need not be computed. The uncertainty on 
the final result is found from the variance of the Monte Carlo sample. This eliminates the 
problem of choosing the value of A% 2 which corresponds to a one-sigma contour in the 
space of parameters, a problem which becomes nontrivial in the presence of incompatible 
data, underestimated errors, or unknown systematics, and which therefore plagues current 
global fits [5,6]. 

Rather, one only has to make sure that each neural network provides a consistent fit to 
its corresponding replica. If the underlying data are incompatible or have underestimated 
errors, the best fit might be worse than one would expect with properly estimated gaussian 
errors — for instance in the presence of underestimated errors it will have typically a 
value of x 2 P er degree of freedom larger than one. If, on the contrary, experimental errors 
have been overestimated, the final x 2 can turn out to be smaller than one per degree of 
freedom. Neural networks are ideally suited for providing a fit in this situation, based on 
the reasonable assumption of smoothness: for example, incompatible data or data with 
underestimated errors will naturally be fitted less accurately by the neural network [9]. 
Also, this allows for non-gaussian behavior of experimental uncertainties. Indeed, whereas 
a neural network with sufficiently large architecture can fit any compatible set of data, the 
choice of a suitable stopping criterion ensures that only the information contained in the 
data is fitted, but not the noise. This can be done by verifying that the quality of the fit 
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of data which have been used in the fitting procedure is the same as the quality of the fit 
to data which have not been used for fitting. 

It is then possible to verify that each neural network has the correct statistical features. 
This step is denoted by 'Tests net-art' in Fig. 1. Finally, the self-consistency of the Monte 
Carlo sample can be tested in order to ascertain that it leads to consistent estimates of the 
uncertainty on the final parton distributions, for example by verifying that the value of the 
parton distribution extracted from different replicas indeed behaves as a random variable 
with the stated variance. This step is denoted by 'Tests net-net' in Fig. 1. This set of tests 
allow us to make sure that the Monte Carlo sample of neural networks provides a faithful 
and consistent representation of the information contained in the data on the probability 
measure in the space of parton distributions, and in particular that the value of parton 
distributions and their correlated uncertainties are correctly estimated. 

Because of the use of neural networks for the parametrization of the parton distribu- 
tions, two technical aspects of our fitting procedure differ from what is done in standard 
parton fits. The first is the method which is used in order to evolve parton distributions 
by solving the QCD evolution equations. Indeed, we solve evolution equations in Mellin 
moment space N [16], since this allows for a fast and accurate numerical solution. However, 
this method is usually applied to parton parametrizations such that the Mellin transform 
of the initial parton distribution can be computed analytically: a numerical approach is 
not viable because in the region where the inverse Mellin transform of the solution exists, 
the Mellin transform of the initial condition does not converge. But a neural network is a 
complicated nonlinear function whose Mellin transform cannot be determined analytically 
in closed form. The alternative of parametrizing directly the initial parton distributions 
in N moment space is unpalatable, because one would need a parametrization in the com- 
plex plane of the N variable, whereas neural networks are most conveniently defined on a 
compact space, such as < x < 1. 

Therefore, we are led to introduce a novel technique, which consists of determining 
the x-space evolution kernel (Green function) from the N space solution, in such a way 
that any parton distribution can be evolved by convolution with this universal kernel. 
This method combines the speed and reliability of the iV-space solution method, with the 
advantages of parametrizing the parton distribution in x space. Furthermore, because the 
kernel is universal (i.e. independent of the specific boundary condition which is evolved) it 
can be computed beforehand to arbitrarily high accuracy and stored; the result can then 
be used during the fitting procedure, thus leading to an extremely efficient evolution. This 
evolution method will be discussed in detail in Sect. 4. 

The other technical peculiarity which is motivated by the use of neural networks is the 
choice of the minimization algorithm. Because of the nonlinear dependence of the neural 
network on its parameters, and the nonlocal dependence of the measured quantities on the 
neural network (cross sections are expressed as convolution of the initial distribution with 
evolution kernels and coefficient functions), a genetic algorithm turns out to be the most 
efficient minimization method. The use of a genetic algorithm is particularly convenient 
when seeking a minimum in a very wide space with potentially many local minima, because 
the method handles a population of solutions rather than traversing a path in the space of 
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solutions. The minimization method will be discussed in detail in Sect. 5. 

At the end of the minimization we end up with a Monte Carlo sample of neural networks 
which provides our best representation of the measure in the space of parton distribution. 
Generally, we expect [9, 10] the uncertainty on our final result to be somewhat smaller 
than the uncertainty on the input data, because the information contained in several data 
points is combined. The properties of this measure can be tested against the input data by 
using it to compute means, variance and covariances which can be compared to the input 
experimental ones which have been used in the parton determination. Also, the stability 
of the result can be tested by excluding some data points from the fit and checking that 
the best-fit result predicts them correctly. These tests are denoted by 'Tests net-exp' in 
Fig. 1, and they conclude our determination. 

3. Experimental data and Monte Carlo generation 
3.1 Experimental data and kinematical cuts 

The nonsinglet parton distribution qi<ss(x, Qq) can be extracted from experimental data on 
proton and deuteron structure functions F^x, Q 2 ) and F^{x, Q 2 )- A detailed discussion of 
all the available data was given in Refs. [9,10]. In short, early SLAC data have extremely 
large uncertainties, while data from JLAB and the E665 collaboration [17] are mostly at 
low Q 2 and are almost entirely excluded by our kinematic cuts. This leaves us with the 
data published by the NMC [18] and BCDMS [19, 20] collaborations. We refer to [9] for 
a discussion of the kinematical and statistical features of these data, in particular for the 
details of their correlated systematics. 

Once the systematics are known, the experimental covariance matrix for each experi- 
ment can be easily computed 



<ri,p(7j,p + FiFja 2 N +6ija 2 s , (3.1) 



where Fi , F i are central values, c^p are the N SJS correlated systematic errors, a at is the total 
normalization uncertainty, and cr^ is the statistical uncertainty. The correlation matrix is 
given by 

0i,totCj,tot 

where the total error a^tot for the i-ih point is given by 

^,tot = ^ls + °lc + F?vl (3-3) 
and the total correlated uncertainty Uj iC is the sum of all correlated systematics 

N sys 

°lc = Y. a lv (3-4) 
P =i 

In order to keep higher twist corrections under control, only data with Q 2 > 3 GeV 2 
are retained. No separate cut in the invariant mass of the final state W 2 turns out to be 
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Figure 2: Kincmatical coverage of the available experimental data on the nonsinglet structure 
function in the (x,Q 2 ) plane. 
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Table 1: Experimental data included in this analysis. All values of a and cov are given as 
percentages. Averages over all data points are noted with brackets (). 



necessary. A discussion on the motivation for this choice of cuts and the possible effect of 
their variation will be given in Sect. 6. 

A scatter plot of the NMC and BCDMS data in the (x, Q 2 ) plane is displayed in 
Fig. 2, with points excluded due to the Q 2 cut marked with diamonds. Note that all these 
excluded data points belong to the NMC experiment. In Table 1, the statistical features 
of the data (after the cut) are summarized. 

3.2 Monte Carlo generation 

The first step in our approach, as discussed in Sect. 2, is to generate a Monte Carlo sample 
of replicas of the experimental data. Each Monte Carlo replica of the original experimental 
data is generated as a multi-gaussian distribution. More precisely, for each data point 
p(exp) _ pNS^ x .^ q2^ we g enera te k = 1, . . . , A^p artificial points F^ art ^ k ^ as follows 

/ Wor. \ 

if rt)(fc) = (l + r£W) U (GXP) + £ rW*i, p + 4%, A ,k = l,...,N iep , (3.5) 

using independent univariate gaussian random numbers r( fc ) for each independent error 
source. 

The value of N Tep is determined in such a way that the Monte Carlo set of replicas 
models faithfully the probability distribution of the data in the original set. A quantitative 
comparison can be performed by defining suitable statistical estimators (see Appendix 
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Table 2: Comparison between experimental and Monte Carlo data. 
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B). Results are presented in Table 2. The results show that a sample of 1000 replicas is 
sufficient to ensure average scatter correlations of 99% and accuracies of a few percent on 
structure functions, errors and correlations. 

Normalization errors are not included in the covariance matrix on the same footing 
as other sources of systematics since, as by now well known [14], this would bias the fit. 
Rather, normalization errors are included by rescaling all errors (statistical cij iS and each 
systematic <7j iP ) independently for each replica by a factor 

_(fc) n . (k) X 

w \ k p = (! + r N )(7 N)vi, P p=l N SJS , (3.6) 

where r$ is the random variable Eq. (3.5) associated to the normalization uncertainty. 
The covariance matrix is then given by 

= ( E ° {k) i*° {k) 3, P \ + 5ij* {k)2 i,s , (3.7) 
in terms of the rescaled uncertainties. 



4. Prom parton distributions to physical observables 

The nonsinglet structure function i ? |^ s (x,Q 2 ) Eq. (2.1) is determined from the nonsinglet 
quark distribution qNs( x , Qo) Eq. (2.3) at a reference scale Q$ by first evolving the parton 
distribution to the scale Q 2 by means of QCD evolution equations, and then convoluting 
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it with the appropriate hard coefficient function C^s(x,a s (Q 2 ))- 

^2 NS (^Q 2 ) = ^/ 1 -CNs(y,a s (Q 2 ))9Ns(-,g 2 ) • (4.1) 

j x y v y j 

Both coefficient functions and the anomalous dimensions which drive perturbative evolution 
are now known up to next-to-next-to leading order (and in fact the coefficient function up 
to NNNLO [21]). Therefore, we will provide determinations of the parton distributions at 
leading (LO), next-to-leading (NLO) and next-to-next-to leading order (NNLO). In this 
section we will discuss in detail the methods that we use in order to actually determine the 
structure function from the input parton distribution, specifically the resolution of evolution 
equations. Only the nonsinglet case will be discussed explicitly, while the (straightforward) 
generalization to the singlet case will be discussed in a forthcoming publication. 

4.1 Leading-twist evolution and coefficient functions 

As well known, QCD evolution equations are most easily written for Mellin moments of 
the parton distributions, defined as 



qNs(N,Q 2 )= Cdxx'^'q^x.Q 1 ) , (4.2) 
J 

where by slight abuse of notation we denote the function and its transform with the same 
symbol. They take the form 

qm(N,Q 2 ) = ^!l 7NS (Ar, as (Q 2 ))q m (N,Q 2 ) , (4.3) 



dlnQ 2 ™"^ ' 4vr 
where the anomalous dimension can be expanded in powers of a s (Q 2 ) as 

7Ns(iV, a s (Q 2 )) = ^ S (N) + ^7S(A0 + \™(N) + . . . (4.4) 

The NNLO contribution has been computed recently in Ref. [22]. Defining analogously the 
Mellin transforms of coefficient function and structure function, the latter is given by 

if s (iV - l,Q 2 ) = ^C m (N,a s (Q 2 ))g m (N, Q 2 ) , (4.5) 
with the perturbative expansion of the coefficient function 

C NS (N, a s (Q 2 )) = 1 + ^f±C^(N) + (^P^J 2 C^(N) + .... (4.6) 

It appears therefore especially convenient [16, 23] to determine the structure function 
in Mellin space and then invert the Mellin transform, since the N space expression of 
the structure function can be determined in closed form, and the problem is reduced to 
the computation of the single Mellin inversion integral. In x-space instead one has to 
solve an integro-differential equation. Even though efficient numerical methods have been 
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developed for the resolution of the integro-differential form of the evolution equation (see 
e.g. [24,25]), the only genuine advantage of an x-space approach is the possibility to work 
directly with an x-space parametrization of parton distributions. As discussed in Sect. 2, 
this option is especially convenient when using neural networks, since these are most easily 
defined as function on a compact space such as < x < 1. 

However, we observe that the solution to the evolution equation (4.3) has the form 

qm(N, Q 2 ) = T(N, a s (Q 2 ) , a s {Q 2 ))qm(N, Q 2 ) , (4.7) 

where the evolution kernel T(N, a s (Q 2 ) , a s (Qq)) is entirely determined by the anomalous 
dimension 7ns {N, a s (Q 2 )), and it does not depend on the boundary condition qisis(N, Qq). 
Hence, if we take an inverse Mellin transform of Eq. (4.7) we get 

qm(x,Q 2 )= f 1 ^T(y,a s (Q 2 ) ,a s (Q 2 ))q m (-,Q 2 ) . (4.8) 

Jx y \y / 

We can then determine the kernel T(x, a s {Q 2 ) , a s (Qq)) explicitly by Mellin inversion, and 
use it to evolve any x-space parton distribution using Eq. (4.8). The only complication is 
related to the large- x behavior of the kernel T(x, a s (Q 2 ) , a s (Qo))- 

Furthermore, we can express the structure function in terms of the initial quark dis- 
tribution as 

F 2 NS (iV - 1, Q 2 ) = if (AT, a s (Q 2 ) , a s {Q 2 )) q NS (N, Q 2 ) , (4.9) 

where 

T{N,a, {Q 2 ),a s {Q 2 )) = C NS (N,a s {Q 2 ))T(N,a s (Q 2 ) ,a s (Q 2 )), (4.10) 

i.e. absorbing the coefficient function in a redefined evolution kernel. Again, we can 
determine the Mellin inverse of this redefined kernel, and use it to obtain the measured 
structure function from the initial x-space parton distribution as 

F^(x,Q 2 ) = \x f 1 ^f(y, as (Q 2 ) ,a s (Q 2 )) qNS (-,Q 2 ) . (4.11) 

j x y v y j 

The evolution kernels can be computed, interpolated to arbitrary accuracy and stored, 
and then used to evolve the initial PDF or determine the structure function from it through 
the computation of a single real convolution integral, eqs. (4.8) and (4.11). 

4.2 Solution of the evolution equation in iV space 

We determine the evolution kernel up to NNLO in terms of the running coupling, for which 
we take the expanded solution of the renormalization group equation, i.e., up to NNLO 



a s (Q 2 ) = a s (Q 2 ) 



LO 



1 + a s (Q 2 ) LQ [a s (Q 2 ) LQ - a s (M 2 )] (b 2 - b 2 ) 



+ a s {Q 2 ) NLO h In 



a s (Q 2 
XL<> " 1 U1 - (Mf) 



f_V^ )NLO 
a. 



(4.12) 
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with 



a. 



(Q 2 ) 



NLO 0s JLO 



(Q 2 ), 



1 - b ± a s (Q 2 ) LQ In (l + o a s (M 2 ) In $L 



«. (Q 2 )lo 



«- (Ml) 



l + /3 a s (Mf) In 
and the beta function coefficients given by 

2 



(4.13) 
(4.14) 



dQ 2 



a s (Q 



2, _ <*s(Q 2 ) 



k=0 



Air 



where 



0o = 11 - -TV, , 
38 

A = 102 - — N f , 

2857 5033 325 ~ 

02 = N f H N 2 f , 

H 2 18 i 54 / ' 



(4.15) 

(4.16) 
(4.17) 
(4.18) 



and bi = 0i/0o- We will use the current PDG [26] average value of a s (Q 2 ), namely 

a s {M 2 z ) = 0.118 ± 0.002 (4.19) 

with M z = 91.187 GeV. 

The evolution factor is then explicitly given by 



T(N,a s {Q 2 ),a s (Q 2 ,)) = 



a s (Q 2 ) 



- T (0)(jv)//3 



+ 



(a 2 (Q 2 ) - a 2 (Q 2 )) [$>(JV) - a s (Q 2 )a s (Q 2 ) (u^(N)) 



l-{a s (Q 2 )-a s (Q 2 ))U^(N) 



(4.20) 



where (adopting the notations and conventions of Ref. [23]) we have introduced nonsinglet 
evolution coefficients 



U^(N) = -±-(^(N)-b^ 0) (N)) , 



(4.21) 



(4.22) 



At NNLO, both the strong coupling and the parton distributions are discontinuous 
when crossing heavy quark thresholds. We set the thresholds m c = 1.4 GeV, m& = 4.5 GeV 
and mt = 175 GeV with, at NNLO, the discontinuity 



(C \ 2 

a sJ+1 (m 2 f ) = a sJ {m 2 f ) + (^J a sJ {m)f , 



(4.23) 
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Figure 3: The path in the complex N-space followed by the Talbot integration path, Eq. (4.26) 
for r = 1. 



where C2 = 14/3 [27]. We match parton distributions at threshold as 

& +1 \N,ml) = q ^\N,ml). (4.24) 

Heavy quark mass terms are included in the hard coefficients, and in the nonsinglet sector 
they vanish up to NLO, while at NNLO they are not known (though they can be approxi- 
mated from their large x behavior [28]). Therefore, we defer the inclusion of heavy quark 
mass effects to a forthcoming full fit including the singlet contribution, hence, for all prac- 
tical purposes, we use a zero-mass variable flavour number scheme in the nonsinglet sector. 
Note that because of this, the nonsinglet structure function nas a NNLO discontinuity 
at the heavy quark thresholds. 

4.3 Evolution kernel in x-space 

The x-space evolution factor is determined by numerical computation of the Mellin inverse 

T(x, a s (Q 2 ) , a s (Q 2 ) ) = / X -^T(N, a s (Q 2 ) , a s (Q 2 ) ) , (4.25) 

of the N space evolution factor T(N, a s (Q 2 ) ,a s (Qq)), Eq. (4.20), as well as of the as- 
sociate T(N,a s (Q 2 ) ,a s (Qq)), Eq. (4.10). The numerical computation of the Mellin in- 
version integral is delicate, because the oscillatory behavior of the integrand at large x is 
not damped by multiplication by an initial PDF. This problem is mitigated by a suitable 
choice of integration path. We choose the Talbot path, defined by the condition 

JV(0) = r0(l/tan0 + i), -vr < 9 < n , (4.26) 

with r a constant, shown in Fig. 3. 

To further improve the numerical efficiency the Fixed Talbot algorithm can be used [29] 
where the integral is replaced by the sum 



T 



Af-1 



\y (N = r) x~ r + ^ Re \x- N ^T (N (6 k )) (1 + ia (6 k )) 



k=l 



(4.27) 
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where a (9) = 6 + (9/ tan 9 - 1) /tan 6, and 9 k = kir/M, and r = 2M/ (5\nl/x). It can be 
shown that M is the relative accuracy, i.e. the number of accurate digits. We shall take 
M = 16. The x— space evolution kernel T(x) determined thus is displayed in Fig. 4, at 
different perturbative orders and in Fig. 5 for different values of Q 2 . 




0.0 
10' 



Q =10 GeV= 
Q 2 =100 GeV 2 
Q z = 10000 GeV z 




Figure 4: Non-singlet evolution kernel T(x) 
computed at different perturbative orders in 
the kinematical region relevant to nonsinglet 
evolution. The evolution scales are Q 2 = 2 
GeV 2 and Q 2 = 10 4 GeV 2 . 



Figure 5: Same as Fig. 4 for the LO depen- 
dence of the evolution factor on the evolution 
length Q 2 . As in the previous case, Qq = 2 
GeV 2 . 



The determination of the evolution kernel presented so far only holds for x < 1. As 
well known (see Appendix A), the Mellin inverse of the anomalous dimension Eq. (4.4) 
only exists for x < 1, while at x = 1 it behaves as a distribution. Because the N- 
space evolution kernel is obtained by exponentiating the anomalous dimension, its Mellin 
transform must be defined through its action on a test function. In particular, we must 
check that integration over the kernel converges as x — > 1. To this purpose, we define a 
subtracted kernel T^ + \x,a s (Q 2 ) ,« s (Qq)) 

T(x, a s (Q 2 ) , a s (Q 2 ) ) = r« (x, a s (Q 2 ) , a s (Q 2 ,) ) + G5(l - x) , (4.28) 
T( + \x,a s {Q 2 ),a s (Q 2 )) = T(x, a s (Q 2 ) , a s (Q 2 )) - G5(l - x). (4.29) 

The constant G is defined as 

G= I dxT(x,a s {Q 2 ),a s {Q 2 )) = T(N,a s {Q 2 ),a s {Q 2 )) , (4.30) 
Jo N=1 

where T(N,a s (Q 2 ) ,ct s (Qq)) is given by Eq. (4.20), and it converges at N = 1. 

The evolution equation, Eq.( 4.8), in terms of the subtracted kernel takes the form 

q(x,Q 2 ) = Gq(x,Q 2 ) + f ^r(+)(y,a s (Q 2 ) ,a s {Q 2 ))q (-,Q 2 

J x y \y 

G- I dyT(y,a s {Q 2 ),a s (Q 2 )) 



q(x,Ql) 



+ 



1 ^T(y,a s (Q 2 ) ,a s (Q 2 )) (q (-,Q 2 ) yq [.r.Ql) 

x y v v y 



(4.31) 
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It is easy to prove that all integrals in Eq. (4.31) converge and can be computed numerically, 
exploiting the fact that the behavior of the evolution kernel in the large x limit is known 
from soft gluon resummation arguments. An explicit proof is given in Appendix A. 

Hence we will use Eq. (4.31) to determine the evolution of the quark distribution. The 
numerical accuracy of this method is tested by comparing it to the benchmark evolution 
tables first presented in Ref. [30] and recently updated including the full NNLO anomalous 
dimension in Ref. [2]. In order to allow for a direct comparison to the benchmark tables, we 
modify some of our final choices in the solution of the evolution equations. Specifically, we 
fix the number of flavors (to n/ = 4) and we keep the NLO contributions to the evolution 
factor Eq. (4.20) in unexpanded form. With these choices, we determine the evolution of 
the u valence parton distribution. The results and the accuracy of this benchmark are 
shown in Table 3, where we use the same parameters as in [2,30]. More details about 
the benchmarking of QCD evolution codes can be found in Ref. [2]. We find an accuracy 
of order O (lO -5 ), comparable to that of current evolution codes. The same result has 
been obtained for the d valence distribution. Comparable accuracy is expected in C-even 
evolution, relevant for our paper (not included in the benchmark [30]) 

The solution of the evolution equation through the determination of an evolution 
factor is particularly efficient because of the universality of the evolution factor itself, 
i.e., its independence of the specific boundary condition which is being evolved. Hence, 
the evolution factor can be precomputed and stored, and then used during the process of 
parton fitting or when evolving different parton distributions, without having to recompute 
it each time. 

This fact can be exploited in an optimal way during parton fitting and parton evolution. 
In parton fitting, a given boundary condition must be evolved many times up to the fixed 
pairs of values of (x,Q 2 ) at which data are available (in our case, those shown in Fig. 2). 
Namely, for the i-th data point, one must compute q{xi,Q 2 ) given by Eq. (4.31). The 
constant G Eq. (4.30) (which depends only on Q 2 ) can be precomputed and stored for all 
required values of Q 2 . Furthermore, the numerical determination of the integral over y on 
the right-hand side of Eq. (4.31) involves the determination of the integrand at a set of 
values of y = y^i, which in turn depend on the given values of Xi,Q 2 : y^ = yki{xi,Q 2 ). 
The integrand is fully determined by knowledge of the following values of initial parton 
distribution and of the evolution kernel: 

r ik = T{y ik ,a s {Q 2 ),a s (Ql)) (4.32) 

Qik = q f — , Qo ) ~ Vikqixi, Qq). (4.33) 

The most computing-intensive task is the determination of the values Eq. (4.32) of 
evolution kernel. We have precomputed and stored these values for all the necessary values 
of Xi, Qf and yik- In practice, we use gaussian integration with Apt = 16 (2 7Vit+1 — l) points 
and determine the values of y accordingly. In Table 4 we show the percentage deviation e 
of the left-hand side of Eq. (4.31), when compared to what is obtained from the use of a 
library routine with nominal percentage accuracy of 10 -5 . It is apparent that = 4, i.e. 
integration with about 500 points is sufficient to reach this accuracy. 
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This method is of course only convenient when evolving up to a fixed set of points in 
the (x, Q 2 ) plane. When evolving up a parton distribution, e.g. in order to input it to some 
other computation, it is instead convenient to precompute a full interpolation of the kernel 
T(x, a s (Q 2 ) , a s (Qq))- Evolution is then reduced to the determination of the convolution 
Eq. (4.31) of this kernel with the parton distribution. This is somewhat less efficient than 
the previous method because it requires the computation of this convolution over the in- 
terpolation each time a point is evolved, but it already avoids the main bottleneck, namely, 
the Mellin inversion of the kernel. We have interpolated the kernel T(x,a s (Q 2 ) ,a s (Qq)) 
and its integral up to x which appears in the first term of Eq. (4.31), using Chebyshev 
polynomials. 

In order to speed up the interpolation, the leading large x behavior determined at 
each perturbative order (as described in Appendix A) is divided out before interpolating, 
which considerably smoothes the large x growth of T(x, a s (Q 2 ) , ct s (Qo))- We have found 
that the use of 200 polynomials for the interpolation considered above produces results 
with a precision of O (l0~ 5 ), for all the (x,Q 2 ) range covered by experimental data. This 
accuracy is enough for present purposes and it could be improved by increasing the number 
of polynomials. In Table 5 we compare results obtained with the interpolated evolution 
factors to the exact evolution for different values of x, with the input parton distribution 
set equal to our best-fit nonsinglet neural parton distribution, to be described in Sect. 6. 

We shall choose Qq = 2 GeV 2 as a starting scale for perturbative evolution. We have 
explicitly checked the independence of the results with respect this choice. 

4.4 Target mass corrections and higher twists 

Even though we discard data at very low Q 2 data in order to minimize the impact of 
higher twist corrections, target-mass corrections [31], which are the dominant higher twist 
corrections and are known in closed form should be included in order to increase the 
accuracy in the low-Q 2 region. The twist-four (i.e. next-to-leading twist) structure function 
with the inclusion of target-mass corrections is given by 

t 2 (X,Q ) - -jj^ -2 h D— j— i2(?TMC, Q ) , ( 4 -<3 4 ) 

where 

mTMC,Q 2 )=f 1 dz F ^^ Q2) , (4.35) 
J £tmc 



2 % n 4M 2 x 2 

^ MC = IT7^ r = 1 + ^' (436) 

where F2 T {x,Q 2 ) is the leading-twist expression Eq. (4.1). 

Fitting directly Eq. (4.34) to the data is impractical, because the nonsinglet quark 
distribution to be determined appears in the I2 integral. Rather, it is more convenient 
to view the contribution proportional to I2 in Eq. (4.34) as a correction to be applied to 
the data. Namely, we re-express -F 2 LT in the I2 integral in terms of the full F2, i.e., to the 
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next-to-leading twist level we simply replace F 2 LT in I2 with F2. Therefore, in practice, the 
function 

r ?TMC 

is fitted to the corrected data: 

F 2 dat " TMC (x, Q 2 ) ee F 2 dat (z, Q 2 ) - 6^ jf * ^F dat (z, Q 2 ) . (4.38) 

The integral over the experimentally measured structure function can be determined 
using the interpolation of the data based on neural networks which was constructed in 
Ref. [9]. Each pseudo-data point is then corrected using Eq. (4.38), and the fit procedes as 
outlined in the previous sections, with F2 replaced by $2- 

Further higher-twist corrections are due to the contribution of subleading twist oper- 
ators in the Wilson expansion. A possible next-to-leading twist contribution from these 
terms can be parametrized as 

F 2 NLT (x, Q 2 ) = F 2 LT (x, Q 2 ) (l + . (4.39) 

We will then assess the size of these corrections by parametrizing the function HT{x) with 
a neural network and fitting it to the data, as we will discuss in Sect. 7.2. 
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xu v (x,Ql) (LH) I xu v (x,Ql) (NNPDF) Rel. error 



Leading order 



10~ 7 


5 7722 10" 


-5 


5 7722 10" 


-5 


3 3760 10" 


-6 


1(T 6 


3.3373 10" 


-4 


3.3373 10" 


-4 


1.6880 10" 


-6 


10~ 5 


1.8724 10" 


-3 


1.8724 10" 


-3 


1.9212 10" 


-6 


10~ 4 


1.0057 10" 


-2 


1.0057 10" 


-2 


1.4095 10" 


-6 


10~ 3 


5.0392 10" 


-2 


5.0392 10" 


-2 


2.6145 10" 


-6 


10- 2 


2.1955 10" 


-1 


2.1955 10" 


-1 


3.1065 10" 


-6 


0.1 


5.7267 10" 


-1 


5.7267 10" 


-1 


6.4524 10" 


-6 


0.3 


3.7925 10" 


-1 


3.7925 10" 


-1 


9.2674 10" 


-6 


0.5 


1.3476 10" 


-1 


1.3476 10" 


-1 


1.1307 10" 


-5 


0.7 


2.3123 10" 


-2 


2.3122 10" 


-2 


2.1165 10" 


-5 


0.9 


4.3443 10" 


-4 


4.3440 10" 


-4 


6.3630 10" 


-5 



Next-to-Leading order 



10- 7 


1.0616 10" 


-4 


1.0616 10" 


-4 


2.1462 10" 


-6 


10~ 6 


5.4177 10" 


-4 


5.4177 10" 


-4 


8.7799 10" 


-6 


10" 5 


2.6870 10" 


-3 


2.6870 10" 


-3 


9.7796 10" 


-6 


10- 4 


1.2841 10" 


-2 


1.2841 10" 


-2 


1.3380 10" 


-5 


10- 3 


5.7926 10" 


-2 


5.7926 10" 


-2 


8.5063 10" 


-6 


10~ 2 


2.3026 10" 


-1 


2.3026 10" 


-1 


3.0757 10" 


-7 


0.1 


5.5452 10" 


-1 


5.5452 10" 


-1 


7.6419 10" 


-7 


0.3 


3.5393 10" 


-1 


3.5393 10" 


-1 


2.6979 10" 


-6 


0.5 


1.2271 10" 


-1 


1.2271 10" 


-1 


2.4466 10" 


-5 


0.7 


2.0429 10" 


-2 


2.0429 10" 


-2 


1.4810 10" 


-5 


0.9 


3.6096 10" 


-4 


3.6094 10" 


-4 


6.0762 10" 


-5 



Next-to-Next-to-Leading order 



10- 7 


1.5287 10" 


-4 


1.5287 10" 


-4 


1.5497 10" 


-5 


10~ 6 


6.9176 10" 


-4 


6.9176 10" 


-4 


5.0711 10" 


-6 


10" 5 


3.0981 10" 


-3 


3.0981 10" 


-3 


9.5455 10" 


-6 


10- 4 


1.3722 10" 


-2 


1.3722 10" 


-2 


1.8022 10" 


-5 


10- 3 


5.9160 10" 


-2 


5.9160 10" 


-2 


5.0631 10" 


-6 


10" 2 


2.3078 10" 


-1 


2.3078 10- 


-1 


2.4853 10" 


-6 


0.1 


5.5177 10" 


-1 


5.5177 10- 


-1 


2.4747 10" 


-6 


0.3 


3.5071 10" 


-1 


3.5071 10- 


-1 


2.8430 10" 


-7 


0.5 


1.2117 10" 


-1 


1.2117 10- 


-1 


3.5893 10" 


-5 


0.7 


2.0077 10" 


-2 


2.0077 10- 


-2 


5.5823 10" 


-6 


0.9 


3.5111 10" 


-4 


3.5109 10- 


-4 


5.8172 10" 


-5 



Table 3: Comparison of the Les Houches parton evolution benchmark results (LH) with our results 
(NNPDF). 
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N it 


(e) 




0-£ 




^max 




# of points with e > 5 10 5 . 


1 


2.4 10" 


-2 


8.5 10" 


-2 


6.6 10" 


-l 


21.3% 


2 


1.3 10" 


-4 


6.2 10- 


-4 


4.0 10- 


-3 


7.2% 


4 


1.6 10" 


-5 


1.5 10" 


-4 


3.3 10" 


-3 


5.4% 


6 


1.6 10" 


-5 


1.5 10" 


-4 


3.3 10" 


-3 


5.4% 



Table 4: Relative accuracy e (see text) in the result obtained for NLO evolution as a function of 
the number of points N pt = 16 (2 Nit+1 — l) used for gaussian integration. 



X 


F 2 NS (x,Q 2 ) (Exact) 


F 2 NS (x,Q 2 )) (Interpolated) 


Rel. error 


Next-to-Next-to-Leading order 


10" 3 


3.7288 10~ 2 


3.7288 10~ 2 


1 io~ 6 


10~ 2 


1.1750 10~ 2 


1.1750 10~ 2 


7 10~ 6 


0.1 


3.3497 10~ 2 


3.3497 10" 2 


1 10" 5 


0.3 


4.4425 10~ 2 


4.4425 10~ 2 


3 10~ 7 


0.5 


2.2159 10~ 2 


2.2159 10~ 2 


3 10~ 5 


0.7 


4.7996 10-3 


4.7996 10~ 2 


6 10~ 5 


0.9 


1.5031 10~ 4 


1.5030 10~ 4 


9 10~ 5 



Table 5: Comparison of the results for exact evolution at NNLO with coefficient function to to 
results obtained interpolated evolution (see text). Evolution is performed from Qq = 2 GcV to 
Q 2 = 230 GeV 2 . The input is taken to be equal to the final central fit discussed in Sect. 6. 
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5. Neural networks 



Neural networks, which we use in order to parametrize parton distributions, are nonlinear 
maps between input ^ and output £^ variables, i.e. they are just a particularly conve- 
nient set of interpolating functions. Like other sets of functions (such as e.g. orthogonal 
polynomials), neural networks, in the limit of infinite size, can reproduce any continuous 
function. However, the standard sets of basis function impose, upon truncation, a specific 
bias on the form of the fitted function: for example, polynomials of fixed degree can only 
have a maximum number of nodes and stationary points, and so on. This makes it difficult 
to obtain an unbiased representation of an unknown function. 

Whereas clearly finite-size neural networks also impose a limitation on the set of func- 
tions that they can reproduce, their nonlinear nature makes it possible to make sure that 
this is not a source of bias. In particular, it is possible to show that stability of the fit 
is obtained, in that increasing the size of the neural network the results of the fit do not 
change. Very crudely speaking, this is a consequence of the fact that the smoothness of the 
fitted function decreases during the fitting process. As a consequence, it is possible to stop 
the fitting procedure before one attains the lowest value of the \ 2 (or other figure of merit), 
when the fit is optimized according to suitable statistical stopping criterion, but before sta- 
tistical fluctuations are fitted. It is then possible to check that the result is independent of 
the size of the neural network: eventually, as the size increases, the best-fit result becomes 
independent of it. This is to be contrasted with standard fits, where one reaches the lowest 
X 2 compatible with the given functional form, and eventually if one increases the size of 
the fitting function no stable fit can be obtained. 

The critical aspects of the construction of a neural network parametrization are thus 
its form, its fitting (also called training) algorithm, and the criterion which is used to stop 
the training at the best fit. We now discuss in turn these points, by presenting both the 
general strategy and specific choices. 

5.1 Structure 

We use multi-layer feed-forward neural networks, schematically depicted in Fig. 6. In these 
neural networks, neurons are arranged into layers I = 1, . . . , L, with j = 1, . . . , n; neurons 
per layer. The output is given by neurons in the last layer, as a function of the output of 
all neurons in the previous layer, which in turn is a function of the output of all neurons in 
the previous layer and so on, starting from the first layer, which provides the input. The 
output £^ of each neuron (j-th neuron of the l-th layer) is given by a nonlinear activation 
function g(x): 



evaluated as a linear combination of the output & of all networks in the previous layers, 




1 = 2,. ..,L, 



(5.1) 




(5.2) 
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Figure 6: Schematic diagram of a feed- forward neural network. 



where U{j (weights) and 9i (thresholds) are free parameters to be determined by the fitting 
procedure, and g(x) is taken to be a sigmoid in the inner layers, 

= i - ! —/ \ ' ( 5 - 3 ) 

1 + exp(— x) 

and linear g{x) = x for the last layer. In practice, it turns out to be convenient to rescale 
both the input and the output of the neural network, in such a way that they both take 
values between and 1: this avoids having weights ufy- whose numerical values span many 
orders of magnitude. We have explicitly checked independence with respect to reasonable 
variations of this rescaling. 

In short, the neural network outputs the values Q as a function of the input values 
and the parameters Uij, 0%. The training of the neural network consists of the determination 
of the best-fit values of these parameters given a set of input-output patterns (data). 

As discussed previously in detail [9] , the choice of the architecture of the neural network 
cannot be derived from general rules and it must be tailored to each specific problem. One 
can roughly guess the size of the neural network based on rules of thumb, then verify by 
actual fitting which is the critical size above which results become independent of the size 
of the networks. Finally, one chooses a size which is somewhat larger than the critical one, 
in order to make sure that results are unbiased. 

In previous work [9], it was found that a neural network with two hidden layers and 
architecture 4-5-3-1 is adequate for a fit of the nonsinglet structure function F2(x,Q 2 ). 
The neural network has four inputs because it turns out to be more efficient to take 
simultaneously as input both x, Q 2 , and their logs In a;, InQ 2 . In the present case, the 
neural network only fits the nonsinglet quark distribution at the initial scale, so only two 
input neurons for x and Inx are required. We then retain the same large architecture, 
which is surely redundant given that now only a function of a single variable is fitted. This 
leads us to the architecture 2-5-3-1. Stability of the results upon this choice will be tested 
in Sect. 6.3. 

The neural network with this structure can be used to parametrize the nonsinglet 
quark distribution directly. However, it turns out to be convenient to actually relate the 
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neural network to the quark distribution through a suitable factor. This can be viewed 
as preprocessing of the data: if the function to be fitted is known to be dominated by 
some behavior in a given region, it may be convenient to only fit the deviation from this 
behavior. 

In our case, it is known that the PDF has to satisfy the kinematical constraint (?ns {x = 
IjQo) = 0- Standard counting rules [32] as well as existing parton fits further suggests 
that the structure function drops as a power as x — > 1 (up to logarithmic corrections): 
typically q^s(x = l,Qo) ~ (1 — x) 5 . This suggests that it might be useful to divide 

x — >1 

out this expected leading behavior as x — > 1. The vanishing constraint at x = 1 is then 
implemented automatically, which is more efficient than the solution adopted in previous 
work [9,10], where the constraint was adopted by adding a Lagrange multiplier, i.e., in 
practice, artificial points at F^{x = 1, Q 2 ) = with various values of Q 2 . 

Furthermore, the output of the neural network is by construction bounded. The last 
layer is a linear function of its input, which in turn has as its largest value the sum 
of weights, given that ^ l ^ in turn are bounded, ^ < 1, according to Eq. (5.3). 
However, both current parton fits as well as Regge theory arguments [32] suggest that 
?ns(^)Qo) bl° ws up as n 0. Even though, of course, this growth is cut off by the fact 
that the data only extend down to a minimum value of x, and not to x = 0, this behavior 
leads to substantial variation of the PDF in the measured region as x decreases, and it is 
therefore advantageous to divide it out. 

Therefore, we parametrize the nonsinglet parton distribution as 

9Ns(x,Qg) = ^^NN(x) , (5.4) 

where NN(s) is the neural network discussed above. Independence of the results on the 
various choices discussed in this section, in particular architecture and pre-processing, will 
be checked explicitly and discussed in the next section. 



5.2 Training 

The procedure of fitting (or training, as it is usually called in the context of neural networks) 
is based on the minimization of a suitable figure of merit. As discussed in Sect. 2, we fit 
to each replica of the data a nonsinglet quark distribution by maximum likelihood, i.e. we 
minimize the error function 

E W [ U ] = J_Y ( F (^)(k) _ F (net)(k)\ f^yA />(art)« _ F (net)(k)\ ^ ^ 

-Ndat ■ - , ^ * S \ J ij \ 3 3 J 

1,3=1- 

where F^ net ^ is determined in terms of the nonsinglet quark distribution at the reference 
scale Qq by Eq. (4.11), and the quark distribution at the reference scale is given by a neural 
network Eq. (5.4). The covariance matrix cov^ fc ^ is defined in Eq. (3.7) and it does not 
include normalization errors, as discussed in Sect. 3.2. Note that is a property of each 
individual replica, whereas the quality of the global fit is given by the x 2 computed from 
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the averages over the sample of trained neural networks, namely 




1 



i,j=i 



(5.6) 



where now the covariance matrix, defined in Eq.( 3.1), includes normalization uncertainties. 

The error function Eq. (5.5) can be minimized with a variety of techniques, including 
standard steepest-descent in the space of parameters. The back-propagation method, 
which was used in Refs. [9, 10] and is often advantageous for neural network training, 
cannot be adopted here because the measured quantity (the observed value of the structure 
function) depends non-locally on the neural network, i.e. it depends on the value of the 
parton distribution for several values of the input variable x. It turns out to be convenient 
to use a genetic algorithm [33] as a minimization method, as already done in Refs. [10,11]. 

The genetic algorithm applied to our problem works in the following way, for each 
replica and repeating the whole procedure for each replica, i.e. N Tep times (we shall omit 
for simplicity the index k which identifies the individual replica). The state of the neural 
network is represented by the weight vector 



where each element tOi corresponds to a weight lo\- or threshold 0j, and N par indicates the 
total number of weights nd thresholds. A set of iV mu t copies of the state vector is then 
generated. Minimization is performed in steps usually referred to as generations (training 
cycles in ref . [10]). At each generation, a new set of copies of the weight vectors is obtained 
by means of two operations. First (mutation) a randomly chosen element of the state vector 
uj) is replaced by a new value, according to the rule 



where r is a uniform random number between and 1 and rj (mutation rate) is a free 
parameter of the minimization algorithm which can be optimized either for the given 
problem, or dynamically during the minimization. Second (selection) a set of vectors with 
low values of the error Eq.( 5.5) are selected out of the total population of N mut individuals, 
and use to replace the original vector. The simplest option is to select the iV sel < iV mu t 
vectors with lowest error. Methods based on probabilistic selection, such as those used 
Ref. [10], explore more efficiently the space of possible mutations. Moreover, in order 
to avoid the local minima and increase the training speed, we have introduced multiple 
mutations. Specifically, we have found that one additional mutation with probability 50% 
and two additional mutations with probability 20% produce a significant improvement of 
the convergence rate. 

The procedure is iterated until the vector with smallest value of the error function of the 
set for each generation meets a suitable convergence criterion, to be discussed in the next 
subsection. Note that at each generation the value of the error function never increases. 
The main advantage of genetic minimization is that it works on a population of solutions, 



U) = (wi,W2,...,WiV pBr ) • 



(5.7) 




(5.8) 
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rather than tracing the progress of one point through parameter space. Thus, many regions 
of parameter space are explored simultaneously, thereby lowering the possibility of getting 
trapped in local minima. Usually, this procedure is further supplemented by crossing within 
the chains of mutated parameters (see e.g. Ref. [10]). However, this is not necessary in 
our case since, thanks to a dedicated preprocessing, mutations alone are enough to reach 
satisfactory minima. Moreover, the use of Monte Carlo replicas should avoid hitting the 
same minimum twice. 

Here, genetic minimization is applied to an initial set of parameters chosen at random. 
In fact, in order to fully explore the space of parameters, it turns out to be advantageous 
to first choose at random the range of parameters, and then their values in this range. In 
practice, the range is chosen at random between 

[-<w)-<7 W) (w) + <7 w ] , (5.9) 

and 

[-{u) + a u ,{w)-a u ] , (5.10) 

where (cj) and are the average and variance of the weights computed from the set of 
best-fit networks obtained in a previous fit. The value of the mutation rate is chosen to 
be i] = 8, and the probabilistic algorithm of Ref. [9] is adopted. While these choices are 
optimized to obtain fast convergence, we have checked that they do not influence the final 
results. 

Finally, because we have to deal with data sets coming from different experiments and 
with different features, it turns out to be advantageous to adopt weighted training, in order 
to ensure that the fit to all experiments is of comparable quality. To this purpose [10], 
more weight is given to experiments with a larger value of E. Namely, the function E m - m 
to be minimized is given by 

j JVexp 

£min = Pj N d3.t,jEj , (5.11) 

3=1 

where N^tj is the number of data points and Ej the value of the error function defined in 
Eq. (5.5) but restricted to the points coming from the j — th experiment, and the weights 
Pj are adjusted dynamically according to 

Pi = (#") > (5-12) 

\ "max / 

where E max is the highest amongst the values of Ej. In order to avoid introducing ar- 
tificial instabilities, the re-weighting is only used if the ratio on the right-hand side of 
Eq. (5.12) differs significantly from one. In practice, no reweighting is implemented if 
Cmin < E NM c/Ebcdms < c ma x, with c min = 0.78 and c max = 1.22. 

5.3 Stopping 

As we explained already, the crucial feature which guarantees a bias-free fit is the possibility 
of stopping the training not at the lowest value of the figure of merit (which might depend on 
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the functional form of the fitting function), but rather when suitable criteria are met. These 
criteria should single out the point where the fit reproduces the information contained in 
the data, but not the statistical noise. Namely, the best fit should have the lowest possible 
value of the figure of merit compatible with the requirement of not overlearning, i.e. not 
fitting statistical fluctuations. In previous work [9, 10], this was done by determining an 
optimal training length. The present case is more subtle, because on the one hand the fitted 
function is much more constrained by the data, on the other hand, the quantity which is 
being fitted is not directly observable. This makes it harder to distinguish overlearning 
from an actual improvement in the fit. 

We therefore adopt the following criterion. We separate the data set into two disjoint 
sets. We then minimize the error function, Eq. (5.5) computed only with the data points 
of the first set (training set, henceforth). During the minimization process, we compute 
the error function from the data of the second set (validation set, henceforth). The best 
fit has been reached when the validation error function ceases to decrease. Overlearning, 
in particular, corresponds to a situation where the training error function keeps decreasing 
while the validation one does not, thereby signaling that the fit is merely reproducing the 
fluctuations of the specific training set. The procedure is reliable and it does not lead to 
loss of information if both the training and the validation set reproduce the features of 
the full data set. This can be simply achieved on average by choosing a different random 
partition for each replica of the data, and then checking that the number of replicas is 
sufficiently large. Similar methods have been widely used in various applications of neural 
networks, and in high-energy physics e.g. in Ref. [34]. 

In practice, we implement this stopping criterion as follows. We define a training frac- 
tion /t r = A^tr/^dat (with default value /tr = 0.5) and for each replica we select at random 
a fraction / tr of points for each experiment, which we use for training, while the remaining 
points are assigned to a validation set. We then compute separately the corresponding 
training and validation covariance matrices. The training error function E$ Eq. (5.5) is 
minimized using a genetic algorithm, and at each generation of the genetic minimization 
the validation error function E^}{1) is computed as a function of the generation index I. 
Having chosen a fixed fraction of points separately for each experiment allows for weighted 

(k) (k) / 

training. Both E tr and E^J(l) are computed neglecting cross-correlations between data in 
the validation and the training set: we have verified that this is a small correction anyway. 

(k) 

We then let the training proceed at least until E tr ' has reached the threshold value 

(k) 

EL' = E m \ n . We choose as default the value E ni \ n = 3 . This ensures that the training 

(k) 

does not stop due to fluctuations in the early stages. Beyond this point, i.e. if EL < E m { n , 
the training is stopped at the Z-th generation if the training error function decreases, 

(Etr(l)) 



{Etr (I ~ N sm )) 

while the validation error function 



< 1 (5.13) 



where 

(42) (0 = ^ E ^(0, (5.i5) 

sm j=/_AT sm+ l 

with the validation error function for the i— th GA generation, and similarly for 

the training error. In other words, the training is stopped if the value of the validation 
error function averaged over N sm generation starts increasing, while the training error 
function decreases. This last condition must be imposed (despite the fact that with a 
genetic algorithm the figure of merit always decreases) because of weighted training: when 
the weighting is readjusted the error function could increase locally. We take N sm = 4 as a 
default value. This averaging of the figures of merit along the training is analogous to the 
determination of moving averages, widely used especially in financial dynamics. 

Finally, we have introduced an upper length to the training process, i.e. a maximum 
number of generation N gen , chosen to correspond to a very long training such that the 
validation error function is no longer expected to decrease. The default value is -/V gen = 800. 
In practice, the criterion condition (5.14) was always met well before this point except in 
a tiny number of cases. 

6. Results 

We present our best-fit result for the nonsinglet quark distribution and its statistical fea- 
tures, and in particular we show that it provides a consistent estimate of both central 
values and errors. We will then show its stability upon variation of the fitting procedure. 

6.1 Next-to-leading order results: central values 




x 



Figure 7: The best-fit NLO nonsinglet quark distribution q^s(x,Ql) in the large- x region. The 
MRST, CTEQ and Alekhin determinations are also shown for comparison. In this and subsequent 
plots of qns we take Q 2 — Ql = 2 GeV 2 . 
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X 



Figure 8: The best-fit NLO nonsinglct quark distribution Qns(^,Qo) m the small- X region. The 
MRST, CTEQ and Alekhin determinations are also shown for comparison. 



The result obtained from a set of N Tep = 1000 neural networks trained to replicas of the 
data with the method described above, using the NLO expressions of the coefficient function 
and anomalous dimension is displayed in Figs. 7 and 8. The error band corresponds to a 
one-sigma contour. The distributions of values of the error function E Eq. (5.5) for the 
training and validation samples are displayed in Fig. 9. These distributions are poissonian 
(gaussian) to good approximation, and both are peaked around a similar value, thereby 
showing that the quality of the fit for points included or not included in the fit is equally 
good. The distribution of training lengths is shown in Fig. 10. It is poissonian to good 
approximation, and it shows that convergence is reached after a small number of generations 
of the genetic algorithm, much smaller than the maximum iVg en which is only reached in a 
tiny number of cases. 
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Figure 9: Distribution of values of the error function E Eq. (5.5) at the stopping point for the 
training and validation samples (left), and for the full data sample (right). 
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Distribution of training lenghts | 



Training lenght 



The distribution of values of the error 
function E Eq. (5.5) for the total data set 
is displayed in Fig. 9, while the statistical 
properties of the fit are summarized in Ta- 
ble 6. Note that, because the quantities 
listed in the table characterize the compar- 
ison of the fit to the data (see Appendix B 
for the definitions), they are determined from 
the structure function rather than the 
quark distribution itself. The error func- 
tions of each fit are distributed to good ap- Figure 10: Distribution of training lengths, 
proximation in a gaussian way about the 

value (E) = 2.27 given in Table 6. A value (E) w 2 is expected for a good fit because 
if errors are correctly estimated, the average standard deviation of true values about the 
measured value should equal the total uncertainty, but replicas are generated about the 
measured values so their average standard deviation about the true values should be twice 
the total uncertainty. However, the fit to each individual replica should be closer to the 
true value, because of the availability of many data points at various values of Q 2 which are 
combined by the fit. Indeed, the average error of the fit is much smaller than the average 



experimental error: (a 



(net)^ 



' dat 



<C (cr (cxp) } dat (see Table 6). 



The fact that the figure of merit for the global fit Eq. (5.6) takes the value x ~ 1 
shows that the central value of the fit, after averaging over replicas, is distributed about 
the measured values as one expects of the correct distribution of true values, namely, with 
variance equal to the experimental error (which is much larger than the error on the fit 
itself as we just discussed). Note that, strictly speaking, the total number of independent 
degrees of freedom is somewhat smaller than the total number of points. However, the 
number of effective free parameters of the neural network is rather smaller than the total 
number of weights and thresholds, because our choice of stopping criterion ensures that 
the neural network is redundant. Hence, the number of degrees of freedom is at most 
by a few percent smaller than A^at (and the correctly estimated \ 2 accordingly larger), 
which is smaller than the uncertainty on \ 2 due to statistical fluctuations, which is of 
order of N. V 2 . However, as a consequence of this, we expect the value of xV-^dat to be 
systematically lower than one by a few percent for an optimal fit. This expectation will be 
borne out by our results. 

We conclude that the central fit is correctly estimated. 



6.2 Next-to-leading order results: uncertainties 

Because experimental errors are much larger than the error on the fit, the values of (E) 
and x 2 discussed in Sect. 6.1 do not test for the accuracy of the one-sigma error band on 
the central fit. We can check for the correctness of the latter by studying the fluctuation of 
predictions obtained from different sets of replicas. To this purpose, we note that we can 
view the prediction qi obtained for the central values of the quark distribution by averaging 
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Total 


NMC 


BCDMS 


V 2 

A tot 

(E) 


U. / 
2.27 


n vo 
U. / Z 

1.99 


V. (o 

2.52 




0.81 


0.66 


0.95 


/ ^.(cxp) \ 
\ a( j /dat 

<- (DCt) > da t 

r [cr( net )] 


0.011 
0.006 

0.59 


0.01 / 
0.009 

-0.04 


0.006 
0.004 

0.86 


/ _(cxp) \ 

</> (DCt) ) da t 
r [p( net )] 


nil 

0.46 
0.15 


U.oy 
0.42 
0.25 


U.lO 

0.50 
0.04 


( COv(eXP, >dat 

< COv(nCt) )dat 
r [cov( nct )] 


8.6 10- 6 
2.1 10~ 5 
0.24 


1.0 10" 5 
3.8 10~ 5 
0.23 


7.2 10" 6 
6.9 10~ 6 
0.57 



Table 6: Statistical estimators for the final sample of N rcp = 1000 neural networks, both for the 
total data points and for the individual experiments incorporated in the fit (NMC and BCDMS). 



over iVrep replicas, as a random variable, whose variance is given by 

Vh] = (6.1) 

in terms of the variance cr 2 . of the set of replicas. We can define an average distance d[q] as 
the root-mean square difference between values of qi obtained by averaging over different 
set of replicas, normalized to the variance Eq. (6.1) (see appendix B). If the error on q 
is correctly estimated, the average of d[q], determined for different points, must approach 
one. We can similarly view the standard deviation of each data point <r 2 ., and test for their 
statistical accuracy by computing the corresponding distances. 

Hence, we test for accuracy of estimates of er- 
rors and correlations by computing this distance 
for the quark distribution qi (determined as an av- 
erage over neural networks) and the error on it a^. 
(determined as variance of the networks) . The dis- 
tance is determined between results obtained from 
a set of N rcp replicas each. The computation is Table 7: Stability estimators as a func- 
performed for 14 points linearly spaced in x in the tion of thc numbcr of trained replicas 
data region (0.05 < x < 0.75) and then averaged, ^ r °P' 
and similarly for the same number of points loga- 
rithmically spaced in the extrapolation region (10~ 3 < x < 10~ 2 ). In order for the result to 
be significant we must make sure that its standard deviation ad is not too large. However, 
because d is a standard deviation itself (normalized to its expected value), ad only scales 
as n -1 / 4 with the number of points n, so ~ 0.5. On the other hand, increasing the 
number of points will not help because, of course, two values of q(x) and q{x') are highly 
correlated when x and x' are close to each other. Thus, we obtain stability by repeating the 
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computation for different random choices of two sets of N Tep replicas among the starting 
2N rep ones, and averaging the results, which reduces the standard deviation according to 
Eq. (6.1). We average over 1000 sets, so that o<i ~ 0.01. 

Results with increasing values of iV rC p are shown in Table 7. The distance already 
converges to the expected value of one, to percent accuracy, when iV rC p = 10. Note that, 
because of Eq. 6.1, the expected distance decreases as J- — : the stability of the result 

we get when the number of replicas is further increased from 100 to 500 shows that this 
behaviour is indeed observed in our sample of neural networks. 

6.3 Stability upon variation of the neural network structure 

After verifying that our results are statistically consistent, we wish to check that they 
do not depend on the details of the fitting procedure. We do this by determining the 
distance discussed in Sect. 6.2, but now computed for a pair of fits which differ in fitting 
procedure. Independence is verified if the distance equals its value predicted on statistical 
grounds (i.e. that which would be obtained from pairs of replicas coming from the same 
fit). Specifically, the computation is done by taking N rep = 50, and then averaging over 
1000 different choices of these 50 replicas among a total set of iV rC p = 100 replicas. Note 
that, because of Eq. 6.1, this verifies independence of the fitting procedure to an accuracy 
which is by a factor \/50 smaller than the uncertainty on the central value. 

The first check is to verify that our results are indeed 
independent of the neural network architecture. We have 
compared the results of the fit with the reference architec- 
ture 2-5-3-1 with the results of a similar fit with a smaller 
neural network architecture 2-4-3-1. The result of this com- 
parison can be seen in Table 8. This confirms that we have 
achieved independence of the number of parameters used, 
both in the data and in the extrapolation region, a property Table 8: Independence of the 
that is very difficult to achieve in standard parton fits with architecture of the neural net- 
fixed functional form. works. 

Next, we estimate the dependence of the results on the different kind of preprocessing, 
namely, on the values of the exponents m and n 

qm(x,Q 2 ) = [ - J> NN(x) , (6.2) 

when varied about the default values m = 3 and n = 1. In Table 9 we display the 
dependence of the x 2 , computed from a set of N rep = 100 replicas, and the stability of the 
fit when 1 < m < 4 (with n = 1 fixed) and when 0.5 < n < 1.25 (with m = 3 fixed). It 
appears that the fit is reasonably stable if 2.5 < m < 3.5 and 0.5 < n < 1.25: central values 
in the data region fluctuate less than three a while the uncertainty is stable. The central 
fit starts changing in a statistically significant way when m and n are varied outside these 
bounds. However, in such case the quality of the fit, as measured by the x 2 i deteriorates 
considerably. Hence, we conclude that the fit is independent of the choice of m, n provided 
these are kept within a range which allows for an optimal quality of the fit. 
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Figure 11: Dependence of the results on the preprocessing exponents m, n, Eq. (6.2) 
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Table 9: Dependence on the preprocessing exponents. 



Fits within the stability region of the preprocessing exponents are compared in Fig. 11. 
Stability of central values and errors in the data and extrapolation region are apparent. 
Note that the distance Eq. (6.1) is computed with iV re p = 50, hence the distance of the 
central values equals about 15% of the error band displayed in the plot. 

7. Theoretical uncertainties 

So far we have checked that our fit reproduces correctly the information contained in the 
data, and thus that the error band computed from it is statistically consistent. A priori, 
there are further sources of theoretical error related to use of perturbative QCD in the 
fitting procedure: specifically those related to higher order perturbative corrections and to 
higher twist terms. In this section we will show that these theoretical errors are in fact 
negligible. All tests in this section, as in Sect. 6.3 are done by computing the \ 2 from a 
set of 100 replicas, and the distance by taking N rep = 50, and then averaging over 1000 
different choices of these 50 replicas among a total set of N rev = 100 replicas. 
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0.08 




Figure 12: Results for xq^s(x, Qq) with two different kinematical cuts in Q 2 . 
7.1 Dependence on kinematic cuts 

We first discuss the dependence of results for the nonsinglet parton distribution q^s(x, Q 2 ) 
on the kinematical cut in Q 2 . In Fig. 12 and Table 10 we compare results obtained when the 
value Q^in is raised to Q^ n = 9 GeV 2 , to results obtained with the default value (Jmin = 3 
GeV 2 . Removing the data points in the region 3 GeV 2 < Q 2 < 9 GeV 2 eliminates the bulk 
of the NMC measurements, i.e. (see Fig. 2) essentially all data with x < 0.1. 

As a consequence, uncertainties increase sizably in the small- x region. However, results 
are remarkably stable in the large x region. In Tab. 10 we compare results in three different 
kinematical regions: where the two fits with different kinematical cuts have data (region I, 
0.2 < x < 0.75), where only the fit with Q 2 min = 3 GeV 2 has data (region II, 0.05 < x < 0.1) 
and finally the region where both fits extrapolate (region III, a; < 10~ 2 ). Not only in region I 
the two fits agree completely, but even in region II, whereas the uncertainty bands increases 
considerably, the central fit is essentially unaffected. This proves the remarkable stability 
of our results. 





x < 10" 2 


5 10~ 2 < x < 0.1 


0.2 < x < 0.75 
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0.6 


0.8 
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(d[a}) 


2.7 
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1.4 



Table 10: Comparison of fits with different kinematical cuts. The fit with (3 2 uin = 3 GeV 2 
(reference) is compared to the fit with = 9 GeV 2 in three different regions of x. 

7.2 Higher twists 

We now test for the possible role of higher twist terms, on top of the target-mass corrections 
which are already included in our default fit as discussed in Sect. 4.4. To this purpose, 
we switch on a twist-four contribution according to Eq. (4.39), and parametrize HT{x) 
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Table 11: Results of the higher- twist analysis. Stability is computed relative to the default NLO 



with a 1-2-1 neural network. In Table 11 we compare the \ 2 obtained by including this 
higher twist term to that obtained by raising the cut in Q 2 . On the one hand, we see that 
including higher-twist corrections without raising the Q 2 does not lead to any improvement 
in the quality of the fit. On the other hand, we see that if the Q 2 is raised, the \ 2 also does 
not improve — in fact it deteriorates slightly, in a way which is unaffected by the inclusion 
of higher twist terms. 

We take the lack of improvement in \ 2 when either the cut is raised, or higher twist 
terms are included, as an indication of the fact that there is no evidence whatsoever for 
higher twist terms even with our lower default cut. The slight deterioration in \ 2 when 
the Q 2 cut is raised is likely to be related to the fact that experimental uncertainties in 
the small-Q 2 region (i.e. experimental uncertainties on NMC data) are somewhat overes- 
timated. 

We conclude that there is no evidence for higher twists for nonsing let F 2 NS data with 
Q 2 > 3 GeV 2 . 

7.3 Higher orders and the value of the strong coupling 

So far we have discussed the determination of the nonsinglet parton distribution through 
a NLO analysis. We now compare these results to those obtained at one less and one more 
perturbative order. Hence, we produce a full LO fit, with a s [M\ ) = 0.130 and a full 
NNLO fit, with a s (M|) = 0.115. The results of these fits are compared to the NLO ones 
in Table 12. No variation of the quality of the fit is found between different perturbative 
orders: the values of \ 2 an d (cr( nct )) d are unaffected by the perturbative order at which 
the computation is performed. Indeed, only very small improvements of the x 2 when going 
from LO to NLO and NNLO have been found in recent nonsinglet fits [35,36]. 

This means that the effects of higher-order corrections are entirely reabsorbed in a 
change of the initial quark distribution. This change is statistically significant when going 
from LO to NLO: the central values moves by d « 10 (see Tab. 12), i.e. by about half 
the error on the central value. Indeed, we have verified that if Tab. 12 is recomputed 
using a set of N iep = 50 replicas, we get (d[q]) dat = 3.2 for the LO fit. This means that 
the LO and NLO central values differ by a fixed amount, independent of the number of 
replicas used to determine the central value, therefore if we normalize to the standard 
deviation Eq. 6.1 the result grows as y^rcp- No statistically significant difference in 
central values is observed when going from NLO to NNLO. Hence, our result supports the 
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Figure 13: Comparison of a LO fit (left) and a NNLO fit (right) to the reference NLO fit. 
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Figure 14: Comparison of the reference fit to fits with a lower (left) or higher (right) value of 
a s (M|) 



conclusion [35] that nonsinglet data are insufficient to provide evidence for higher-order 
perturbative corrections. These results are apparent in Fig. 13, where the LO, NLO and 
NNLO fits are compared. 
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Table 12: Quantitative comparison of fits at different perturbative orders. The distance is com- 
puted relative to the NLO fit using a set of iV r0 p = 500 replicas. 

The fact that NLO corrections have a negligible impact on the fit suggests that the fit 
is only weakly dependent on the value of a s (M§) . We have repeated our NLO fit while 
varying the value of a s (Af|) by one sigma about its central value, i.e. with a s (M§) = 
0.116 and a s (M§) = 0.120. Comparison to the central fit (Fig. 14 and Tab. 13) shows 
that the variation of the central fit due to the change in a s (M§) in this range has marginal 
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Table 13: Quantitative comparison of fits at NLO with different values of a s (M|). 

statistical significance: (d [q])^ at V^dat ~ 0.2, i.e., the central fit moves by about a fifth of 
a standard deviation when a s (M|) is varied in this range. This variation, though small, 
appears to be stable: if 50 replicas are used we get (d [q]) dat = 1-3 for a s (M|) = 0.116 and 
(d [q]) dat = 1-1 for a s (M§) = 0.120, corresponding to the same value of (d [q]) dat V^dat ~ 
0.2. The variation of the \ 2 in this range is also marginal and it does not show a clear 
trend. We conclude that a NLO determination of a s based on nonsinglet data only is in 
principle possible, but it is necessarily affected by an uncertainty which is sizably larger 
than the error 0.002 on the PDG average. A smaller value for this uncertainty would be 
a likely indication of an underestimate of the uncertainty related to the choice of parton 
parametrization. 

8. The structure function 

We finally turn to results for the physically measurable structure function F|^ s (x,(5 2 ), 
which we compare to the data and to results obtained in various other approaches. 

8.1 Comparison to data and other parton fits 

In Figs. 15 and 16 we show a comparison of our results for the nonsinglet structure function 
i ? 2 fS (x, Q 2 ) as a function of x at fixed Q 2 = 15 GeV 2 compared to all data with 13 < Q 2 < 
17 GeV 2 , and as a function of Q 2 at fixed x = 0.15, compared to all data in the range 
0.13 < x < 0.17. In the same figure, we also show the structure function obtained from 
the results of the global fits of the CTEQ [5] and MRST [6] collaborations and the fit to 
deep-inelastic data by Alekhin [4]. 

The uncertainty in the final structure function is much smaller than that on the data, 
thanks to the theoretical information from perturbative evolution. Nevertheless, the un- 
certainty in our determination, especially at small x, is rather larger than that found in 
previous analyses. One may wonder whether this is due to the fact that the global analyses 
are based on a wider set of data. However, in the small x region, where the difference 
between the uncertainty on our result and that found by other groups is most striking, 
none of the data contained in these global fits can constrain q^s (x, Qq), on top of the data 
on F2 S (x,Q 2 ) which we also include. This suggests that the smaller error band obtained 
by these groups may be due to parametrization bias or uncertainty underestimation. 
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Figure 16: Our NLO determination of the nonsinglet structure function F|^ s (a;,Q 2 ) as a function 
of Q 2 for x = 0.15, compared to all data in the range 0.13 < x < 0.17. Determinations by other 
groups are also shown. 



Be that as it may, wheras all available fits are within our error band, our central result 
disagrees with these fits, which in turn disagree with each other within respective errors, 
even in the valence region 0.1 < x < 0.3. 

Finally, it is interesting to compare our NNLO determination of the nonsinglet parton 
distribution to the results of some recent NNLO fits to nonsinglet DIS data [35, 36] (see 
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Figure 17: Comparison of our NNLO fit with the results of the nonsinglet fits of Refs. [36] and [35] . 

Fig. 17). Results are analogous to those found when comparing to global fits. 
8.2 Comparison to neural F^ s (x,Q 2 ) 

In Ref. [9] the nonsinglet structure function F^ s (x,Q 2 ) was directly parametrized with 
neural networks as a function of its two arguments, instead of being obtained by evolving 
a parton distribution. The results of that analysis are compared to the present NLO 
determination in Fig. 18. The comparison shows that the uncertainty on the determination 
of Ref. [9] is already smaller than that of the data, because of the requirement of smoothness 
of the function which is fitted to them, as discussed in detail in Ref. [9]. However, thanks 
to the extra information from perturbative evolution, the uncertainty is further reduced by 
a considerable amount when fitting the structure function. Also, it is apparent that thanks 
to this error reduction, a more detailed determination of the shape of F^ s is possible. We 
also display the pull (see Appendix B) between these two determinations, which shows 
that they are in perfect agreement within one sigma in the region where there are data: 
whenever the pull increases in modulo above one, no data are available. 

9. Conclusions and outlook 

In this paper we have provided a first determination of a parton distribution within an 
approach based on the use of neural networks coupled to a Monte Carlo representation of 
the probability density in the space of structure functions. 

Our final results take the form of a Monte Carlo sample of neural networks. The non- 
singlet parton distribution q^s ( x ,Qo) an d its statistical moments (errors, correlations and 
so on) can be determined by averaging over this sample. These results together with driver 
FORTRAN code can be downloaded from the web site http:/ /sophia. ecm.ub.es/nnpdf/. 

The main distinctive features of our determination of the isotriplet quark distribution 
when compared to that obtained in other fits are the following: 
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Figure 18: Comparison of the structure function _F'^ ,s (x,Q 2 ) obtained from our NLO q^six, Qo) 
with the NNPDF parametrization of F|^ s (x, Q 2 ) at Q 2 = 15 GeV 2 : structure function (left) and 
pull (right). The pull at Q 2 = 55 GeV 2 is also shown (lower curve at large x). 



• Our fit is demonstrably independent of the parametrization, and the uncertainty on 
it has the correct statistical behavior which one expects of an unbiased statistical 
estimator. 

• The uncertainty is greatly reduced in comparison to that on the data, and also to 
that obtained in fits [9] which do not exploit the constraint of perturbative parton 
evolution. 

• The error band on our result, especially at small x, is rather larger than that of other 
fits, and in fact it includes these other fits even when they disagree with each other. 

• The quality of fits at LO, NLO and NNLO is the same: difference in perturbative 
order can be reabsorbed in the choice of boundary condition. 

Our approach is designed to tackle some of the most serious problems that plague 
current global parton fits, namely, the difficulty of keeping under control the bias due to 
the choice of parametrization, problems in combining different data sets, and the problem 
of determining faithfully the uncertainty on the final result. This first application to the 
determination of a single parton distribution from all structure function data shows that 
the approach is viable and successful. Its ultimate success or failure will be in its application 
to a global parton fit from all available data. 
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A. Large x behavior of the evolution factor 



At leading order in a s (Q 2 ) , in the large x limit, the dominant contribution to the evolution 
kernel comes from the large N limit of the LO anomalous dimension, given by 



' c 2™ \a s (Ql) 



(A.l) 



The inverse Mellin transformation can be performed to all logarithmic orders using the 
formulas of Ref. [37], with the result 



r(x) = £ 
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where A (17) = l/r(r/). If Q 2 > * ne exponent of 1 — x in Eq. (A.2) is larger than one, 
thereby ensuring that, thanks to the + prescription, T(x) is integrable. This in particular 
implies that all integrals in Eq. (4.31) exist. Note that in the Q 2 —> 00 limit the A prefactor 
vanishes, consistent with asymptotic freedom. 



B. Statistical estimators 



We define statistical estimators that are used to asses the quality of both the Monte Carlo 
replica generation (replica averages) and the neural network training (neural network av- 
erages and statistical distance). The superscripts (dat), (art) and (net) refer respectively 
to the original data, to the A^ rep Monte Carlo replicas of the data, and to the -/V rep neural 
networks. The subscripts rep and dat refer respectively to whether averages are taken by 
summing over all replicas or over all data. 



• Replica averages 

- Average over the number of replicas for each experimental point i 
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- Associated covariance 
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We define analogously {PE (cr^)} 
Scatter correlation: 
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where the scatter variances are defined as 
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We define analogously r [cj( art ^ , r [p( art )] and r [cov( art )] . Note that the scatter 
correlation and scatter variance are not related to the variance and correlation 
Eqs. (B.2)-(B.4). 

Average variance: 
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We define analogously (p (art ' ) ) dat an d (cov^ 81 *^ t , as wen as the corresponding 
experimental quantities. 

• Neural network averages 

— Average error over networks 



(B.10) 



re P k=i 



where E^ is given by Eq. (5.5). 
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— Average over neural networks for each experimental point i 
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where i^ net ) is the value of the nonsinglet structure function computed using 
Eq. (4.11) from the neural network parametrization of the nonsinglet quark 
distribution at the values of x and Q 2 which correspond to the i-th data point 
- Associated variance 
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The corresponding means over the iV dat data points are computed using the 
analogue of Eq. (B.9). 
Scatter correlation 
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We define analogously (p^ net ^) dat an d ( cov ^ net ^)dat' 
• Distance between samples 

- Distance between different sets of neural nets 
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where is the value of the nonsinglet quark distribution for the i-th. data point 



computed for the A;-th replica from a set of N^p neural networks, and likewise 
value computed for the k-th replicf 

networks. 



(2) " (2) 

q ik is the value computed for the k-th replica from a second set of iV r ep neural 
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— The distance between errors and correlations is defined using Eq. (B.16), but 
with the average value Eq. (B.17) replaced by the variance 




(B.19) 



and similarly for the covariance. Note that the distance is just the square root 
of the sample variance normalized to the variance of which it is an unbiased 
estimator (see Ref. [15], in particular for a discussion of the variance of d). 



• Pull 



If F2 (x, Q 2 ) and F 2 (x, Q 2 ) are two different determinations of the structure func- 
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